library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.1.8 ✔ tsibble 1.1.3
## ✔ dplyr 1.1.0 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.0
## ✔ lubridate 1.9.2 ✔ fable 0.3.2
## ✔ ggplot2 3.4.1 ✔ fabletools 0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(feasts)
Some common transformations which can be used when modelling were discussed in Section 3.1. When forecasting from a model with transformations, we first produce forecasts of the transformed data. Then, we need to reverse the transformation (or back-transform) to obtain forecasts on the original scale. For Box-Cox transformations given by (3.1), the reverse transformation is given by
\[ \begin{equation} \tag{5.2} y_{t} = \begin{cases} \exp(w_t) & \text{if } \lambda = 0; \\ \text{sign}(\lambda w_t+1)|\lambda w_t+1|^{1/\lambda} & \text{otherwise} \end{cases} \end{equation} \]
The fable package will automatically back-transform the
forecasts whenever a transformation has been used in the model
definition. The back-transformed forecast distribution is then a
“transformed Normal” distribution.
If a transformation has been used, then the prediction interval is first computed on the transformed scale, and the end points are back-transformed to give a prediction interval on the original scale. This approach preserves the probability coverage of the prediction interval, although it will no longer be symmetric around the point forecast.
The back-transformation of prediction intervals is done automatically
when using the fable package, provided you have used a
transformation in the model formula.
Transformations sometimes make little difference to the point forecasts but have a large effect on prediction intervals.
One issue with using mathematical transformations such as Box-Cox transformations is that the back-transformed point forecast will not be the mean of the forecast distribution. In fact, it will usually be the median of the forecast distribution (assuming that the distribution on the transformed space is symmetric). For many purposes, this is acceptable, although the mean is usually preferable. For example, you may wish to add up sales forecasts from various regions to form a forecast for the whole country. But medians do not add up, whereas means do.
For a Box-Cox transformation, the back-transformed mean is given (approximately) by
\[ \begin{equation} \tag{5.3} \hat{y}_{T+h|T} = \begin{cases} \exp(\hat{w}_{T+h|T})\left[1 + \frac{\sigma_h^2}{2}\right] & \text{if } \lambda=0;\\ (\lambda \hat{w}_{T+h|T}+1)^{1/\lambda}\left[1 + \frac{\sigma_h^2(1-\lambda)}{2(\lambda \hat{w}_{T+h|T}+1)^{2}}\right] & \text{otherwise;} \end{cases} \end{equation} \]
where \(\hat{w}_{T+h|T}\) is the \(h\)-step forecast mean and \(\sigma_h^2\) is the \(h\)-step forecast variance on the transformed scale. The larger the forecast variance, the bigger the difference between the mean and the median.
The difference between the simple back-transformed forecast given by (5.2) and the mean given by (5.3) is called the bias. When we use the mean, rather than the median, we say the point forecasts have been bias-adjusted.
To see how much difference this bias-adjustment makes, consider the following example, where we forecast the average annual price of eggs using the drift method with a log transformation \((\lambda = 0)\). The log transformation is useful in this case to ensure the forecasts and the prediction intervals stay positive.
prices |>
filter(!is.na(eggs)) |>
model(RW(log(eggs) ~ drift())) |>
forecast(h = 50) |>
autoplot(prices |> filter(!is.na(eggs)),
level = 80,
point_forecast = lst(mean, median)) +
labs(title = "Annual egg prices",
y = "$US (in cents adjusted for inflation) ")
## Warning in ggplot2::geom_point(mapping = mapping, data =
## dplyr::semi_join(object, : Ignoring unknown aesthetics: linetype
The dashed line in Figure 5.17 shows the forecast medians while the solid line shows the forecast means. Notice how the skewed forecast distribution pulls up the forecast distribution’s mean; this is a result of the added term from the bias adjustment.
Bias-adjusted forecast means are automatically computed in the
fable package. The forecast median (the point forecast
prior to bias adjustment) can be obtained using the
median() function on the distribution column.
\[y_t = \hat{S}_t + \hat{A}_t,\]
where \(\hat{A}_t=\hat{T}_t+\hat{R}_t\) is the seasonally adjusted component.
\[y_t = \hat{S}_t\hat{A}_t,\]
where \(\hat{A}_t = \hat{T}_t\hat{R}_{t}\).
To forecast a decomposed time series, we forecast the seasonal component, \(\hat{S}_t\), and the seasonally adjusted component \(\hat{A}_t\), separately. It is usually assumed that the seasonal component is unchanging, or changing extremely slowly, so it is forecast by simply taking the last year of the estimated component. In other words, a seasonal naïve method is used for the seasonal component.
To forecast the seasonally adjusted component, any non-seasonal forecasting method may be used. For example, the drift method, or Holt’s method (discussed in Chapter 8), or a non-seasonal ARIMA model (discussed in Chapter 9), may be used.
us_retail_employment <- us_employment |>
filter(year(Month) >= 1990, Title == "Retail Trade")
dcmp <- us_retail_employment |>
model(STL(Employed ~ trend(window = 7), robust = TRUE)) |>
components() |>
select(-.model)
dcmp |>
model(NAIVE(season_adjust)) |>
forecast() |>
autoplot(dcmp) +
labs(y = "Number of people",
title = "US retail employment")
Figure 5.18: Naïve forecasts of the seasonally adjusted data obtained from an STL decomposition of the total US retail employment.
Figure 5.18 shows naïve forecasts of the seasonally adjusted US retail employment data. These are then “reseasonalised” by adding in the seasonal naïve forecasts of the seasonal component.
This is made easy with the decomposition_model()
function, which allows you to compute forecasts via any
additive decomposition, using other model functions to forecast each of
the decomposition’s components. Seasonal components of the model will be
forecast automatically using SNAIVE() if a different model
isn’t specified. The function will also do the reseasonalising for you,
ensuring that the resulting forecasts of the original data are obtained.
These are shown in Figure 5.19.
fit_dcmp <- us_retail_employment |>
model(stlf = decomposition_model(
STL(Employed ~ trend(window = 7), robust = TRUE),
NAIVE(season_adjust)
))
fit_dcmp |>
forecast() |>
autoplot(us_retail_employment)+
labs(y = "Number of people",
title = "US retail employment")
Figure 5.19: Forecasts of the total US retail employment data based on a naïve forecast of the seasonally adjusted data and a seasonal naïve forecast of the seasonal component, after an STL decomposition of the data.
The prediction intervals shown in this graph are constructed in the same way as the point forecasts. That is, the upper and lower limits of the prediction intervals on the seasonally adjusted data are “reseasonalised” by adding in the forecasts of the seasonal component.
The ACF of the residuals, shown in Figure 5.20, displays significant autocorrelations. These are due to the naïve method not capturing the changing trend in the seasonally adjusted series.
fit_dcmp |> gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).
The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required. The following points should be noted.
Some references describe the test set as the “hold-out set” because these data are “held out” of the data used for fitting. Other references call the training set the “in-sample data” and the test set the “out-of-sample data”. We prefer to use “training data” and “test data” in this book.
filter() is useful when extracting a
portion of a time series, such as we need when creating training and
test sets. When splitting data into evaluation sets, filtering the index
of the data is particularly useful. For example,
aus_production |> filter(year(Quarter) >= 1995)
extracts all data from 1995 onward. Equivalently,
aus_production |> filter_index("1995 Q1" ~ .)
slice()
allows the use of indices to choose a subset from each group.
aus_production |>
slice(n()-19:0)
extracts the last 20 observations (5 years).
Slice also works with groups, making it possible to subset observations from each key.
aus_retail |>
group_by(State, Industry) |>
slice(1:12)
will subset the first year of data from each time series in the data.
A forecast “error” is the difference between an observed value and its forecast.
\[e_{T+h} = y_{T+h} - \hat{y}_{T+h|T},\]
where the training data are given by \(\{y_1, \dots, y_T\}\) and the test data is given by \(\{y_{T+1}, y_{T+2},\dots\}\).
Note that forecast errors are different from residuals in two ways. First, residuals are calculated on the training set while forecast errors are calculated on the test set. Second, residuals are based on one-step forecasts while forecast errors can involve multi-step forecasts.
We can measure forecast accuracy by summarising the forecast errors in different ways.
The forecast errors are on the same scale as the data. Accuracy measures that are based only on \(e_t\) are therefore scale-dependent and cannot be used to make comparisons between series that involve different units.
The two most commonly used scale-dependent measures are based on the absolute errors or squared errors:
\[ \begin{align*} \text{Mean absolute error: MAE} & = \text{mean}(|e_{t}|),\\ \text{Root mean squared error: RMSE} & = \sqrt{\text{mean}(e_{t}^2)}. \end{align*} \]
When comparing forecast methods applied to a single time series, or to several time series with the same units, the MAE is popular as it is easy to both understand and compute. A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Consequently, the RMSE is also widely used, despite being more difficult to interpret.
The percentage error is given by \(p_t=100e_t/y_t\). Percentage errors have the advantage of being unit-free, and so are frequently used to compare forecast performances between data sets. The most commonly used measure is:
\[\text{Mean absolute percentage error: MAPE} = \text{mean}(|p_{t}|).\]
Measures based on percentage errors have the disadvantage of being infinite or undefined if \(y_t=0\) for any \(t\) in the period of interest, and having extreme values if any \(y_t\) is close to zero. Another problem with percentage errors that is often overlooked is that they assume the unit of measurement has a meaningful zero.3 For example, a percentage error makes no sense when measuring the accuracy of temperature forecasts on either the Fahrenheit or Celsius scales, because temperature has an arbitrary zero point.
They also have the disadvantage that they put a heavier penalty on negative errors than on positive errors. This observation led to the use of the so-called “symmetric” MAPE (sMAPE) proposed by Armstrong (1978, p. 348), which was used in the M3 forecasting competition. It is defined by
\[\text{sMAPE} = \text{mean}\left(200|y_{t} - \hat{y}_{t}|/(y_{t}+\hat{y}_{t})\right).\]
However, if \(y_t\) is close to zero, \(\hat{y}_t\) is also likely to be close to zero. Thus, the measure still involves division by a number close to zero, making the calculation unstable. Also, the value of sMAPE can be negative, so it is not really a measure of “absolute percentage errors” at all.
Hyndman & Koehler (2006) recommend that the sMAPE not be used. It is included here only because it is widely used, although we will not use it in this book.
Scaled errors were proposed by Hyndman & Koehler (2006) as an alternative to using percentage errors when comparing forecast accuracy across series with different units. They proposed scaling the errors based on the training MAE from a simple forecast method.
For a non-seasonal time series, a useful way to define a scaled error uses naïve forecasts:
\[q_{j} = \frac{\displaystyle e_{j}} {\displaystyle\frac{1}{T-1}\sum_{t=2}^T |y_{t}-y_{t-1}|}.\]
Because the numerator and denominator both involve values on the scale of the original data, \(q_j\) is independent of the scale of the data. A scaled error is less than one if it arises from a better forecast than the average one-step naïve forecast computed on the training data. Conversely, it is greater than one if the forecast is worse than the average one-step naïve forecast computed on the training data.
For seasonal time series, a scaled error can be defined using seasonal naïve forecasts
\[q_{j} = \frac{\displaystyle e_{j}} {\displaystyle\frac{1}{T-m}\sum_{t=m+1}^T |y_{t}-y_{t-m}|}.\]
Mean absolute scaled error
\[\text{MASE} = \text{mean}(|q_{j}|).\]
Root mean squared scaled error
\[\text{RMSSE} = \sqrt{\text{mean}(q_{j}^2)},\]
where
\[q^2_{j} = \frac{\displaystyle e^2_{j}} {\displaystyle\frac{1}{T-m}\sum_{t=m+1}^T (y_{t}-y_{t-m})^2},\]
and set \(m=1\) for non-seasonal data
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
beer_train <- recent_production |>
filter(year(Quarter) <= 2007)
beer_fit <- beer_train |>
model(
Mean = MEAN(Beer),
`Naïve` = NAIVE(Beer),
`Seasonal naïve` = SNAIVE(Beer),
Drift = RW(Beer ~ drift())
)
beer_fc <- beer_fit |>
forecast(h = 10)
beer_fc |>
autoplot(
aus_production |> filter(year(Quarter) >= 1992),
level = NULL
) +
labs(y = "Megalitres",
title = "Forecasts for quarterly beer production") +
guides(colour = guide_legend(title = "Forecast"))
Figure 5.21 shows four forecast methods applied to the quarterly Australian beer production using data only to the end of 2007. The actual values for the period 2008–2010 are also shown. We compute the forecast accuracy measures for this period.
accuracy(beer_fc, recent_production)
The accuracy() function will automatically extract the
relevant periods from the data (recent_production in this example) to
match the forecasts when computing the various accuracy measures.
It is obvious from the graph that the seasonal naïve method is best for these data, although it can still be improved, as we will discover later. Sometimes, different accuracy measures will lead to different results as to which forecast method is best. However, in this case, all of the results point to the seasonal naïve method as the best of these four methods for this data set.
To take a non-seasonal example, consider the Google stock price. The following graph shows the closing stock prices from 2015, along with forecasts for January 2016 obtained from three different methods.
google_stock <- gafa_stock |>
filter(Symbol == "GOOG", year(Date) >= 2015) |>
mutate(day = row_number()) |>
update_tsibble(index = day, regular = TRUE)
google_2015 <- google_stock |> filter(year(Date) == 2015)
google_jan_2016 <- google_stock |>
filter(yearmonth(Date) == yearmonth("2016 Jan"))
google_fit <- google_2015 |>
model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = RW(Close ~ drift())
)
google_fc <- google_fit |>
forecast(google_jan_2016)
google_fc |>
autoplot(bind_rows(google_2015, google_jan_2016),
level = NULL) +
labs(y = "$US",
title = "Google closing stock prices from Jan 2015") +
guides(colour = guide_legend(title = "Forecast"))
accuracy(google_fc, google_stock)
Here, the best method is the naïve method (regardless of which accuracy measure is used).
The preceding measures all measure point forecast accuracy. When evaluating distributional forecasts, we need to use some other measures.
Consider the Google stock price example from the previous section. Figure 5.23 shows an 80% prediction interval for the forecasts from the naïve method.
google_fc |>
filter(.model == "Naïve") |>
autoplot(bind_rows(google_2015, google_jan_2016),
level = 80) +
labs(y = "$US",
title = "Google closing stock prices")
The lower limit of this prediction interval gives the 10th percentile (or 0.1 quantile) of the forecast distribution, so we would expect the actual value to lie below the lower limit about 10% of the time, and to lie above the lower limit about 90% of the time. When we compare the actual value to this percentile, we need to allow for the fact that it is more likely to be above than below.
More generally, suppose we are interested in the quantile forecast with probability \(p\) at future time \(t\) , and let this be denoted by \(f_{p,t}\) . That is, we expect the observation \(y_t\) to be less than \(f_{p,t}\) with probability \(p\). For example, the 10th percentile would be \(f_{0.10,t}\). If \(y_t\) denotes the observation at time \(t\), then the Quantile Score is
\[ Q_{p,t} = \begin{cases} 2(1-p)(f_{p,t}-y_t), & \text{if } y_t < f_{p,t} \\ 2p(y_t-f_{p,t}), & \text{if } y_t >= f_{p,t} \end{cases} \]
This is sometimes called the “pinball loss function” because a graph of it resembles the trajectory of a ball on a pinball table. The multiplier of 2 is often omitted, but including it makes the interpretation a little easier. A low value of \(Q_{p.t}\) indicates a better estimate of the quantile.
The quantile score can be interpreted like an absolute error. In fact, when \(p=0.5\) , the quantile score \(Q_{0.5,t}\) is the same as the absolute error. For other values of \(p\), the “error” \(y_t-f{p,t}\) is weighted to take account of how likely it is to be positive or negative. If \(p>0.5\), \(Q_{p,t}\) gives a heavier penalty when the observation is greater than the estimated quantile than when the observation is less than the estimated quantile. The reverse is true for \(p<0.5\).
In Figure 5.23, the one-step-ahead 10% quantile forecast (for 4 January 2016) is \(f_{0.1,t}=744.54\) and the observed value is \(y_t=741.84\) Then
\[Q_{0.1,t} = 2(1-0.1) (744.54 - 741.84) = 4.86.\]
This is easily computed using accuracy() with the
quantile_score() function:
google_fc |>
filter(.model == "Naïve", Date == "2016-01-04") |>
accuracy(google_stock, list(qs=quantile_score), probs = 0.10)
It is often of interest to evaluate a prediction interval, rather than a few quantiles, and the Winkler score proposed by Winkler (1972) is designed for this purpose. If the \(100(1-\alpha)%\) prediction interval at time \(t\) is given by \([\ell_{\alpha,t},u_{\alpha,t}]\) then the Winkler score is defined as the length of the interval plus a penalty if the observation is outside the interval:
\[ W_{\alpha,t} = \begin{cases} (u_{\alpha,t} - \ell_{\alpha,t}) + \frac{2}{\alpha} (\ell_{\alpha,t} - y_t) & \text{if } y_t < \ell_{\alpha,t} \\ (u_{\alpha,t} - \ell_{\alpha,t}) & \text{if } \ell_{\alpha,t} \le y_t \le u_{\alpha,t} \\ (u_{\alpha,t} - \ell_{\alpha,t}) + \frac{2}{\alpha} (y_t - u_{\alpha,t}) & \text{if } y_t > u_{\alpha,t}. \end{cases} \]
For observations that fall within the interval, the Winkler score is simply the length of the interval. So low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.
Prediction intervals are usually constructed from quantiles by setting \(\ell_{\alpha,t}=f_{\alpha/2,t}\) and \(u_{\alpha,t}=f_{1-\alpha/2,t}\). If we add the corresponding quantile scores and divide by \(\alpha\) we get the Winkler score:
\[W_{\alpha,t} = (Q_{\alpha/2,t} + Q_{1-\alpha/2,t})/\alpha.\]
The one-step-ahead 80% interval shown in Figure 5.23 for 4 January 2016 is \([744.54, 773.22]\), and the actual value was 741.84, so the Winkler score is
\[W_{\alpha,t} = (773.22 - 744.54) + \frac{2}{0.2} (744.54 - 741.84) = 55.68. \]
This is easily computed using accuracy() with the
winkler_score() function.
google_fc |>
filter(.model == "Naïve", Date == "2016-01-04") |>
accuracy(google_stock,
list(winkler = winkler_score),
level = 80)
Often we are interested in the whole forecast distribution, rather than particular quantiles or prediction intervals. In that case, we can average the quantile scores over all values of \(p\) to obtain the Continuous Ranked Probability Score or CRPS (Gneiting & Katzfuss, 2014)
In the Google stock price example, we can compute the average CRPS value for all days in the test set. A CRPS value is a little like a weighted absolute error computed from the entire forecast distribution, where the weighting takes account of the probabilities.
google_fc |>
accuracy(google_stock, list(crps = CRPS))
This shows the Naive method giving better distributional forecasts.
As with point forecasts, it is useful to be able to compare the distributional forecast accuracy of several methods across series on different scales. For point forecasts, we used scaled errors for that purpose. Another approach is to use skill scores. These can be used for both point forecast accuracy and distributional forecast accuracy.
With skill scores, we compute a forecast accuracy measure relative to some benchmark method. For example, if we use the naïve method as a benchmark, and also compute forecasts using the drift method, we can compute the CRPS skill score of the drift method relative to the naïve method as
\[ \frac {\text{CRPS}_{\text{Naïve}} - \text{CRPS}_{\text{Drift}}} {\text{CRPS}_{\text{Naïve}}}. \]
This gives the proportion that the drift method improves
over the naïve method based on CRPS. It is easy to compute
using the accuracy() function.
google_fc |>
accuracy(google_stock,
list(skill = skill_score(CRPS)))
Of course, the skill score for the naïve method is 0 because it can’t improve on itself. The other two methods have larger CRPS values than naïve, so the skills scores are negative; the drift method is 26.6% worse than the naïve method.
The skill_score() function will always compute the CRPS for the
appropriate benchmark forecasts, even if these are not included in the
fable object. When the data are seasonal, the benchmark used is the
seasonal naïve method rather than the naïve method. To ensure that the
same training data are used for the benchmark forecasts, it is
important that the data provided to the accuracy() function
starts at the same time as the training data.
The skill_score() function can be used with any accuracy
measure. For example, skill_score(MSE) provides a way of
comparing MSE values across diverse series. However, it is important
that the test set is large enough to allow reliable calculation of the
error measure, especially in the denominator. For that reason,
MASE or RMSSE are often preferable scale-free measures for point
forecast accuracy.
A more sophisticated version of training/test sets is time series cross-validation. In this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. Since it is not possible to obtain a reliable forecast based on a small training set, the earliest observations are not considered as test sets.
The following diagram illustrates the series of training and test sets, where the blue observations form the training sets, and the orange observations form the test sets.
The forecast accuracy is computed by averaging over the test sets. This procedure is sometimes known as “evaluation on a rolling forecasting origin” because the “origin” at which the forecast is based rolls forward in time.
With time series forecasting, one-step forecasts may not be as relevant as multi-step forecasts. In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used. Suppose that we are interested in models that produce good \(4\)-step-ahead forecasts. Then the corresponding diagram is shown below.
In the following example, we compare the accuracy obtained via time
series cross-validation with the residual accuracy. The
stretch_tsibble() function is used to create many training
sets. In this example, we start with a training set of length
.init=3, and increasing the size of successive training
sets by .step=1.
google_2015_tr <- google_2015 |>
stretch_tsibble(.init = 3, .step = 1) |>
relocate(Date, Symbol, .id)
google_2015_tr
The .id column provides a new key indicating the various training sets. The accuracy() function can be used to evaluate the forecasts accuracy across the training sets.
** TSCV accuracy
google_2015_tr |>
model(RW(Close ~ drift())) |>
forecast(h = 1) |>
accuracy(google_2015)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 1 observation is missing at 253
google_2015 |>
model(RW(Close ~ drift())) |>
accuracy()
| Evaluation method | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| Cross-validation | 11.27 | 7.26 | 1.19 | 1.02 |
| Training | 11.15 | 7.16 | 1.18 | 1.00 |
As expected, the accuracy measures from the residuals are smaller, as the corresponding “forecasts” are based on a model fitted to the entire data set, rather than being true forecasts.
A good way to choose the best forecasting model is to find the model with the smallest RMSE computed using time series cross-validation.